-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Germline CNV WDLs for WGS #6607
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @mwalker174! Could you also make necessary changes to the README.md file in the gatk/scripts/cnv_wdl/germline/ directory?
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
@@ -413,99 +441,82 @@ task ScatterIntervals { | |||
} | |||
} | |||
|
|||
task PostprocessGermlineCNVCalls { | |||
task BundledPostprocessGermlineCNVCalls { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep it named PostprocessGermlineCNVCalls
, since it's basically doing same work as before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, WDL tasks are pretty much 1:1 with tools and should just reflect the tool name, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
Boolean use_ssd = false | ||
Int? cpu | ||
Int? preemptible_attempts | ||
File invariants_tar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change invariants
in the name here and elsewhere to something more explicit -- like bundled_gcnv_posteriors
or bundled_gcnv_outputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I still need to read on to figure out what exactly is being bundled, but the name is not super descriptive...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to bundled_gcnv_outputs
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
gatk --java-options "-Xmx~{command_mem_mb}m" PostprocessGermlineCNVCalls \ | ||
$calls_args \ | ||
$model_args \ | ||
time gatk --java-options "-Xmx~{command_mem_mb}m" PostprocessGermlineCNVCalls \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for calling time
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a relic from when we were optimizing for wgs. Removed time
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
cat $calling_configs_list | sort -V > calling_configs_list.sorted | ||
cat $denoising_configs_list | sort -V > denoising_configs_list.sorted | ||
cat $gcnvkernel_version_list | sort -V > gcnvkernel_version_list.sorted | ||
cat $sharded_interval_lists_list | sort -V > sharded_interval_lists_list.sorted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We seem to be hitting an error in our Travis gCNV WDL tests -
The number of entries in the copy-number posterior file for shard 0 does not match the number of entries in the shard interval list (posterior list size: 21, interval list size: 20)
Is it possible that some mismatch between shards is introduced here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm not sure what was happening there, especially since my own tests completed successfully. But my latest changes don't seem to be triggering this error.
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
@@ -605,3 +616,122 @@ task CollectModelQualityMetrics { | |||
String qc_status_string = read_string("qcStatus.txt") | |||
} | |||
} | |||
|
|||
task BundlePostprocessingInvariants { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here for invariant
name as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some comments to indicate what this task is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to BundleCallerOutputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually in my latest commit this is now a TransposeCallerOutputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm only halfway through my review, but I'm not sure I understand the need for a lot of these changes. Which changes are critical for improving performance---e.g., removing the samples x shards transpose---and which are just rearranging output?
For the transpose, did we check whether recent versions of Cromwell/Terra are OK?
It also seems like we are bundling up some things (global configs/interval lists, all gCNV calls by sample) and then splitting up others (contig ploidy calls per sample). Unless I'm misunderstanding the code, it seems that we are no longer consistent in how various quantities are split up (i.e., by shard or by sample). There are also some flattening operations that I don't quite understand. As we've discussed, we should see which of these global quantities can be scrapped---probably only need to keep the sharded intervals.
Perhaps we can discuss at today's gCNV meeting?
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
Int machine_mem_mb = select_first([mem_gb, 7]) * 1000 | ||
Int command_mem_mb = machine_mem_mb - 1000 | ||
|
||
Boolean enable_indexing_ = select_first([enable_indexing, false]) | ||
Array[String] disabled_read_filters_arr = if(defined(disabled_read_filters)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might just put this ternary all on one line (or make the style of optional arrays like allosomal-contigs
below consistent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
@@ -413,99 +441,82 @@ task ScatterIntervals { | |||
} | |||
} | |||
|
|||
task PostprocessGermlineCNVCalls { | |||
task BundledPostprocessGermlineCNVCalls { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, WDL tasks are pretty much 1:1 with tools and should just reflect the tool name, if possible.
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
Boolean use_ssd = false | ||
Int? cpu | ||
Int? preemptible_attempts | ||
File invariants_tar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I still need to read on to figure out what exactly is being bundled, but the name is not super descriptive...
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
String genotyped_intervals_vcf_filename = "genotyped-intervals-~{entity_id}.vcf.gz" | ||
String genotyped_segments_vcf_filename = "genotyped-segments-~{entity_id}.vcf.gz" | ||
String denoised_copy_ratios_filename = "denoised_copy_ratios-~{entity_id}.tsv" | ||
|
||
Array[String] allosomal_contigs_args = if defined(allosomal_contigs) then prefix("--allosomal-contig ", select_first([allosomal_contigs])) else [] | ||
|
||
command <<< | ||
set -eu | ||
export GATK_LOCAL_JAR=~{default="/root/gatk.jar" gatk4_jar_override} | ||
set -euo pipefail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that should be changed throughout all CNV WDLs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is good practice, although -o pipefail
will have no effect in this task since piping isn't used anywhere
################################################### | ||
#### arguments for PostprocessGermlineCNVCalls #### | ||
################################################### | ||
Int ref_copy_number_autosomal_contigs | ||
Array[String]? allosomal_contigs | ||
Int? disk_space_gb_for_postprocess_germline_cnv_calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The task name in the comment section header should match the name of the WDL task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
@@ -605,3 +616,122 @@ task CollectModelQualityMetrics { | |||
String qc_status_string = read_string("qcStatus.txt") | |||
} | |||
} | |||
|
|||
task BundlePostprocessingInvariants { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some comments to indicate what this task is doing.
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
for (( i=0; i<~{num_samples}; i++ )) | ||
do | ||
sample_id=${sample_ids[$i]} | ||
sample_no=`printf %04d $i` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way this is originally done in the gCNV tasks is more robust to the maximum number of samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why is the renaming added to the ploidy calls here, but removed from the gCNV calls? Are those guaranteed to line up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added dynamic scaling of the sample index padding. I don't think we need sample ids in the call tars since these are only consumed internally. The sample order will be the same between the input bam array and SAMPLE_
subdirectories that gCNV creates, so they will line up.
Array[File] read_counts_entity_id = flatten(CNVGermlineCaseWorkflow.read_counts_entity_id) | ||
Array[File] read_counts = flatten(CNVGermlineCaseWorkflow.read_counts) | ||
Array[File] sample_contig_ploidy_calls_tars = flatten(CNVGermlineCaseWorkflow.sample_contig_ploidy_calls_tars) | ||
Array[File] gcnv_calls_tars = flatten(CNVGermlineCaseWorkflow.gcnv_calls_tars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to flatten some of these quantities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right I shouldn't flatten sample_contig_ploidy_calls_tars
and gcnv_calls_tars
, but the rest are sample-wise and will correspond to the size/order of the input bams
OK, finally tracked down that original issue from Mehrtash concerning the bundling: #4397 As we discussed, there was a lot of back and forth to try to resolve this issue, and it was confounded by a lexicographical bug (which may have been reintroduced here). The last chapter in this saga was #5490 If the matrix transpose is still troublesome and we can avoid it by being more clever with WDL indexing, then maybe we can explore that. Or we can just see if there are analogous existing WDLs and borrow their solution. However, note that @mwalker174 indicated that the creation of the matrix itself is troublesome for call caching. If bundling is the only answer and we are willing to pay the cost of localizing all gCNV results to all shards, it might make things easier to first bundle everything up at the end of each gCNV task. Also, would Cromwell be able to handle things if we change the bundling from a) all gCNV results (i.e., across all samples and shards) to b) a single bundled global quantity (model + interval lists) + calls bundled (across shards) per sample? Each postprocessing task would then take the global bundle + the bundle containing calls for that sample. That seems like it would resolve Mehrtash's original complaint, while still minimizing the number of files whizzing around. We also discussed batching by sample at the postprocessing task level, but I think we want to keep this task at the per-sample level for parallelism. |
Also, if we can't figure this out, then I really think it's worth kicking it up to Cromwell to see if call caching can be made more scalable. |
f146f66
to
3931c8d
Compare
daeda8e
to
13317db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly looked over the transpose operation. It looks pretty good!
scripts/cnv_wdl/cnv_common_tasks.wdl
Outdated
runtime { | ||
docker: docker | ||
memory: select_first([mem_gb, 2]) + " GiB" | ||
disks: "local-disk " + select_first([disk_space_gb, 150]) + if use_ssd then " SSD" else " HDD" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runtime { | |
docker: docker | |
memory: select_first([mem_gb, 2]) + " GiB" | |
disks: "local-disk " + select_first([disk_space_gb, 150]) + if use_ssd then " SSD" else " HDD" | |
Int disk_baseline_gb_ = 10 | |
Float compression_factor = 10.0 | |
Int disk_needed_gb = disk_baseline_gb + ceil(compression_factor * size(gcnv_calls_tars, "GiB")) | |
runtime { | |
docker: docker | |
memory: select_first([mem_gb, 2]) + " GiB" | |
disks: "local-disk " + select_first([disk_space_gb, disk_needed_gb]) + if use_ssd then " SSD" else " HDD" |
I don't see memory being a problem here, but these kinds of tasks have been running of out disk on larger panels. This should be conservative enough without being super wasteful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This would have been good but it turns out we aren't using this task. It might be useful to have automatic scaling for the rest of the gCNV tasks but I'll leave this for later work.
13317db
to
7084775
Compare
The transpose operation appears to work with Cromwell v51, so I've reverted a good chunk of the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwalker174 This looks good! just a few minor comments.
What happen to the bundling performance improvement changes by the way?
padded_sample_index=$(printf "%0${num_digits}d" $i) | ||
tar -czf sample_${padded_sample_index}.${sample_id}.contig_ploidy_calls.tar.gz -C calls/SAMPLE_${i} . | ||
done | ||
>>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space here for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -314,7 +326,7 @@ task DetermineGermlineContigPloidyCaseMode { | |||
--mapping-error-rate ~{default="0.01" mapping_error_rate} \ | |||
--sample-psi-scale ~{default="0.0001" sample_psi_scale} | |||
|
|||
tar czf case-contig-ploidy-calls.tar.gz -C ~{output_dir_}/case-calls . | |||
tar c -C ~{output_dir_}/case-calls . | gzip -1 > case-contig-ploidy-calls.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this change? Also, why is it not in cohort mode as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Faster compression with gzip -1
I believe. This is okay in case mode since the calls tars aren't usually kept in storage except as intermediates, so the tradeoff with larger file size doesn't outweigh the cost of compressing/decompressing on VMs.
The large 2D file array can be handled by the latest Cromwell versions, so we do not need to bundle. It is much more elegant and readable this way and should actually improve performance. |
disabled_read_filters
input toCollectCounts
CollectCounts
andCollectAllelicCounts